-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565
PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565
Conversation
4231e0c
to
dbd9ddc
Compare
Codecov Report
@@ Coverage Diff @@
## master #16565 +/- ##
==========================================
- Coverage 90.75% 40.16% -50.6%
==========================================
Files 161 161
Lines 51073 51078 +5
==========================================
- Hits 46350 20513 -25837
- Misses 4723 30565 +25842
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16565 +/- ##
==========================================
- Coverage 90.75% 90.75% -0.01%
==========================================
Files 161 161
Lines 51095 51097 +2
==========================================
+ Hits 46370 46371 +1
- Misses 4725 4726 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. pls add a 0.20.2 bug fix indexing (you can also say perf improvement there), whatsnew note. we r releasing tomorrow, so do today if you can.
pandas/core/indexes/multi.py
Outdated
@@ -1263,6 +1263,11 @@ def remove_unused_levels(self): | |||
|
|||
.. versionadded:: 0.20.0 | |||
|
|||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexes are immutable. We never use inplace. remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um, are you sure? Index.set_names, MultiIndex.set_levels, and MultiIndex.set_labels already use inplace. And unlike the latter two functions, remove_unused_levels
can't change the result of values
, meaning it should be very safe to use in place since it won't change the semantics of any data structures using the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i am sure
those are terrible patterns for an immutable object and should be removed
pandas/core/indexes/multi.py
Outdated
# nothing changed | ||
if not changed.any(): | ||
return self | ||
if inplace: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use shallow_copy here
pandas/tests/indexes/test_multi.py
Outdated
assert result2.is_(result) | ||
|
||
def test_remove_unused_levels_large(self): | ||
# because tests should be deterministic: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
pandas/tests/indexes/test_multi.py
Outdated
|
||
def test_remove_unused_levels_large(self): | ||
# because tests should be deterministic: | ||
rng = np.random.RandomState(4) # chosen by fair dice roll. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use random state here
pandas/tests/indexes/test_multi.py
Outdated
assert len(result.levels[1]) < len(df.index.levels[1]) | ||
assert result.equals(df.index) | ||
|
||
# in place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
result = df.index.remove_unused_levels() | ||
assert len(result.levels[0]) < len(df.index.levels[0]) | ||
assert len(result.levels[1]) < len(df.index.levels[1]) | ||
assert result.equals(df.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also tests with different things for the levels, e.g. add another example with dates & strings
assert len(result.levels[0]) < len(df.index.levels[0]) | ||
assert len(result.levels[1]) < len(df.index.levels[1]) | ||
assert result.equals(df.index) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also construct that this IS equal to .reset_index(..).set_index(...).index
. (for each example)
8a9fe43
to
62c5907
Compare
@jreback, I removed (Oh look, it just failed like this in CI.) |
pandas/tests/indexes/test_multi.py
Outdated
assert result.equals(df.index) | ||
|
||
expected = df.reset_index().set_index(['first', 'second']).index | ||
assert result.equals(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should replace this with the stronger tm.assert_index_equal()
. I plan to do so when updating this test based on what you tell me to do about the non-determinism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use parameterize instead of doing it this way: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#using-pytest
pandas/tests/indexes/test_multi.py
Outdated
assert result.equals(df.index) | ||
|
||
expected = df.reset_index().set_index(['first', 'second']).index | ||
assert result.equals(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use parameterize instead of doing it this way: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#using-pytest
oh that's fine. wasn't really sure why you were explicity setting the random. yes sometime you will not remove levels based on the filter, so ok putting it back. |
62c5907
to
9852c6f
Compare
@rhendric lgtm. can you add an asv for this (asv/benchmarks/indexing.py) is ok. you can use your test example (just the int/int is fine). and pls post the result. |
* Add a large random test case for remove_unused_levels that failed the previous implementation * Fix pandas-dev#16556, a performance issue with the previous implementation * Always return at least a view instead of the original index
I'm new to asv; what is the result you want me to post? Console output? A file somewhere? |
9852c6f
to
83bdc59
Compare
implement a suitable asv. then you run it should run and then post results. see http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite |
@@ -248,6 +254,9 @@ def time_multiindex_small_get_loc_warm(self): | |||
def time_is_monotonic(self): | |||
self.miint.is_monotonic | |||
|
|||
def time_remove_unused_levels(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep!
|
ok that's slightly better 😄 |
thanks! |
(cherry picked from commit 8d092d9)
(cherry picked from commit 8d092d9)
Version 0.20.2 * tag 'v0.20.2': (68 commits) RLS: v0.20.2 DOC: Update release.rst DOC: Whatsnew fixups (pandas-dev#16596) ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460) BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444) PERF: vectorize _interp_limit (pandas-dev#16592) DOC: whatsnew 0.20.2 edits (pandas-dev#16587) API: Make is_strictly_monotonic_* private (pandas-dev#16576) BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565) Strictly monotonic (pandas-dev#16555) ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026) fix linting BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244) BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317) return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486) BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549) BUG: Fixed tput output on windows (pandas-dev#16496) Strictly monotonic (pandas-dev#16555) BUG: fixed wrong order of ordered labels in pd.cut() BUG: Fixed to_html ignoring index_names parameter ...
Add a large random test case for remove_unused_levels that failed the
previous implementation
Fix PERF: remove_unused_levels is very slow #16556, a performance issue with the previous implementation
Add inplace functionalityAlways return (if not inplace) at least a view instead of the original
index